K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data

نویسندگان

  • Ziv Bar-Joseph
  • Erik D. Demaine
  • David K. Gifford
  • Angèle M. Hamel
  • Tommi S. Jaakkola
  • Nathan Srebro
چکیده

MOTIVATION A major challenge in gene expression analysis is effective data organization and visualization. One of the most popular tools for this task is hierarchical clustering. Hierarchical clustering allows a user to view relationships in scales ranging from single genes to large sets of genes, while at the same time providing a global view of the expression data. However, hierarchical clustering is very sensitive to noise, it usually lacks of a method to actually identify distinct clusters, and produces a large number of possible leaf orderings of the hierarchical clustering tree. In this paper we propose a new hierarchical clustering algorithm which reduces susceptibility to noise, permits up to k siblings to be directly related, and provides a single optimal order for the resulting tree. RESULTS We present an algorithm that efficiently constructs a k-ary tree, where each node can have up to k children, and then optimally orders the leaves of that tree. By combining k clusters at each step our algorithm becomes more robust against noise and missing values. By optimally ordering the leaves of the resulting tree we maintain the pairwise relationships that appear in the original method, without sacrificing the robustness. Our k-ary construction algorithm runs in O(n(3)) regardless of k and our ordering algorithm runs in O(4(k)n(3)). We present several examples that show that our k-ary clustering algorithm achieves results that are superior to the binary tree results in both global presentation and cluster identification. AVAILABILITY We have implemented the above algorithms in C++ on the Linux operating system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast optimal leaf ordering for hierarchical clustering

We present the first practical algorithm for the optimal linear leaf ordering of trees that are generated by hierarchical clustering. Hierarchical clustering has been extensively used to analyze gene expression data, and we show how optimal leaf ordering can reveal biological structure that is not observed with an existing heuristic ordering method. For a tree with n leaves, there are 2(n-1) li...

متن کامل

Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis

Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...

متن کامل

Repeated Record Ordering for Constrained Size Clustering

One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...

متن کامل

Clustering, Leaf-ordering and Visualization for Intuitive Analysis of Deoxyribonucleic-Acid Chip Data

Generally the result data from DNA chip experiments have lots of gene expression information. Scientists want to get perspective insight or want to find intuitive fact from that data. Hierarchical clustering is the most widely used method for analysis of gene expression data. In this paper, we address leaf-ordering, which is a post-processing for the dendrograms – a sort of edge-weighted binary...

متن کامل

Optimal Arrangement of Leaves in the Tree Representing Hierarchical Clustering of Gene Expression Data

In this paper, we study how to present gene expression data to display similarities by trying to find a linear ordering of genes such that genes with similar expression profiles will be close in this ordering. In general, finding the best possible order is intractable, and furthermore an unrestricted ordering may not be desired. Therefore we concentrate on the case in which hierarchical cluster...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Bioinformatics

دوره 19 9  شماره 

صفحات  -

تاریخ انتشار 2002